Naive Bayes and Map-Reduce

نویسنده

  • William Cohen
چکیده

We’ll start out with a very simple learning algorithm: multinomial Naive Bayes. Our implementation is in Table 1. Each training example is a labeled document d = (i, y, (w1, . . . , wni)) with an identifier i, a label y from a small set Y = {y1, . . . , yK}, and a “bag of words”. The bag of words are wj’s, encoded here as a list of strings, so that wj is the word/token at position j of document i. For example, the bag of words for the paragraph below would be the strings: “when”, “scaling”, “to”, . . . , “large”, “datasets”, and ”:”. The test documents are the same, but their labels are unknown. Our goal is to read a training set, build a classifier, and then predict a label for each test example. When scaling to large datasets, the first thing to remember is that main memory (RAM) is, relatively speaking, limited and expensive, while disk space is less limited and cheap. So we want to carefully control how much memory we use. In particular, if you care about processing large datasets:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Approach for Text Documents Classification with Invasive Weed Optimization and Naive Bayes Classifier

With the fast increase of the documents, using Text Document Classification (TDC) methods has become a crucial matter. This paper presented a hybrid model of Invasive Weed Optimization (IWO) and Naive Bayes (NB) classifier (IWO-NB) for Feature Selection (FS) in order to reduce the big size of features space in TDC. TDC includes different actions such as text processing, feature extraction, form...

متن کامل

Diagnosis of Pulmonary Tuberculosis Using Artificial Intelligence (Naive Bayes Algorithm)

Background and Aim: Despite the implementation of effective preventive and therapeutic programs, no significant success has been achieved in the reduction of tuberculosis. One of the reasons is the delay in diagnosis. Therefore, the creation of a diagnostic aid system can help to diagnose early Tuberculosis. The purpose of this research was to evaluate the role of the Naive Bayes algorithm as a...

متن کامل

The naive Bayes text classification algorithm based on rough set in the cloud platform

This paper improves the naïve bayesian classification algorithm , combining with the rough set theory we can get a naive bayesian classifier algorithm based on the rough set. We implement this algorithm on a cloud platform using map-reduce programming mode and get a excellent result. A recall rate of 76.4 was achieved when classifying Tibetan Web pages .

متن کامل

In silico prediction of anticancer peptides by TRAINER tool

Cancer is one of the causes of death in the world. Several treatment methods exist against cancer cells such as radiotherapy and chemotherapy. Since traditional methods have side effects on normal cells and are expensive, identification and developing a new method to cancer therapy is very important. Antimicrobial peptides, present in a wide variety of organisms, such as plants, amphibians and ...

متن کامل

Map-Reduce for Machine Learning on Multicore

We are at the beginning of the multicore era. Computers will have increasingly many cores (processors), but there is still no good programming framework for these architectures, and thus no simple and unified way for machine learning to take advantage of the potential speed up. In this paper, we develop a broadly applicable parallel programming method, one that is easily applied to many differe...

متن کامل

On Why Discretization Works for Naive-Bayes Classifiers

We investigate why discretization is effective in naive-Bayes learning. We prove a theorem that identifies particular conditions under which discretization will result in naiveBayes classifiers delivering the same probability estimates as would be obtained if the correct probability density functions were employed. We discuss the factors that might affect naive-Bayes classification error under ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015